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Metagenomics is a primary tool for the description of microbial and viral communities. 
The sheer magnitude of the data generated in each metagenome makes identifying key 
differences in the function and taxonomy between communities difficult to elucidate. 
Here we discuss the application of seven different data mining and statistical analyses 
by comparing and contrasting the metabolic functions of 212 microbial metagenomes 
within and between 10 environments. Not all approaches are appropriate for all ques- 
tions, and researchers should decide which approach addresses their questions. This work 
demonstrated the use of each approach: for example, random forests provided a robust 
and enlightening description of both the clustering of metagenomes and the metabolic 
processes that were important in separating microbial communities from different envi- 
ronments. All analyses identified that the presence of phage genes within the microbial 
community was a predictor of whether the microbial community was host-associated or 
free-living. Several analyses identified the subtle differences that occur with environments, 
such as those seen in different regions of the marine environment. 

Keywords: metagenomics, statistics, microbiology, random forest, canonical discriminant analysis, principal 
component analysis 



INTRODUCTION 

Vast communities of microbes occupy every environment, con- 
suming and producing compounds that shape the local geochem- 
istry. Over the last several years sequence based approaches have 
been developed for the large-scale analysis of microbial com- 
munities. This technique, typically called metagenomics, involves 
extracting and sequencing the DNA en masse, and then using high 
performance computational analysis to associate function to each 



sequence. Annotation of a metagenome is conducted by compar- 
ing the sample DNA to that available in various databases, such 
as NCBI, SEED, MG-RAST, or COG (Wooley et al., 2010). The 
number of sequences similar to each protein is identified; there- 
fore a metagenome provides information on the taxonomic make 
up and metabolic potential of a microbial community. 

Most of the focus in metagenomics has been on single envi- 
ronments such as coral atolls (Wegley et al., 2007; Dinsdale et al, 
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2008b), cow intestines (Brule et al, 2009), ocean water (Angly 
et al., 2006), and microbialites (Breitbart et al., 2009). Early work 
compared extremely different environments, like soil microbes 
compared to water microbes (Tringe et al., 2005). More recently, 
the Human Microbiome Project has expanded our understanding 
of the microbes inhabiting our own bodies, comparing samples 
from the same site among and between individuals (Kurokawa 
et al, 2007; Turnbaugh et al, 2007, 2009). These studies reflect 
the dynamic and expanding field of metagenomics which has 
been reviewed elsewhere (Wooley et al., 2010). Previously, we 
demonstrated that analysis of functional diversity in metagenomes 
could differentiate the microbial processes occurring in multiple 
environments (Dinsdale et al., 2008a). That study utilized the 
only publicly available metagenomes at that time: 45 microbial 
samples and 42 viral samples. The raw DNA sequences were com- 
pared to the SEED subsystems (Overbeek et al, 2005), and the 
normalized proportion of sequences in each subsystem in each 
metagenome were used as the input. That provided a raw data 
set with 23 response variables and 87 observations (45 microbial 
metagenomes and 42 viral metagenomes) or samples. In that first 
study, a canonical discriminant analysis (CDA) was used on a low 
number of samples from highly disparate environments. In this 
analysis, we describe a wider range of statistical analyses and use 
a larger sample size, to describe the abilities of metagenomes to 
describe the metabolic profile of microbial communities. 

Even though metagenomics provides a complete analysis of the 
microbial activity, the results are complicated to interpret because 
a typical output is a list of BLAST matches to many thousands 
of proteins. Some programs for testing significance levels between 
metagenomes have been written and most use bootstrapping to 
avoid problems associated with the low number of replicates 
(Rodriguez-Brito et al, 2006; Parks and Beiko, 2010). Web based 
sites are being created which enable researchers to conduct statis- 
tical analysis, with no explanation of the suitability of the analysis 
(Arndt et al, 2012). The most common question biologists pose 
when conducting a metagenomic analysis is how the microbial 
community taxa or metabolic potential vary between sampling 
locations or time points. To answer this question requires the 
analysis and visualization of large amounts of multivariate data. 
To date, a few statistical tests are routinely used, including princi- 
pal component analysis (PCA), multidimensional scaling (MDS), 
and CDA, similar to more traditional analyses of microbial com- 
munities and genomic data where PCA dominates the analyses 
(Ramette, 2007). 

There are many statistical tools that can be used to explore mul- 
tivariate data as provided by metagenomes. Here we provide an 
overview of seven different statistical techniques, out of the many 
that could be used, to compare and contrast metagenomes from 
different environments. In particular, we focus on tools for the 
classification and visualization of metagenomic data. In this work, 
we are concerned with how metabolic potential of the microbial 
community varies within and between environments. 

It is important to realize that the statistical tests used will 
depend on the question the researcher is exploring. Not every sta- 
tistical test should be used for every analysis, but several analyses 
can be used in combination to answer the same research ques- 
tion. For example, random forests are a robust analysis, but do 



not provide a good visualization of the data. Therefore, we com- 
bine random forest analysis with either MDS or CDA to visualize 
the outcome of the random forest. In this work, we have focused 
on clustering and visualization to show how metagenomes vary 
between and within environments and identify the metabolic 
processes that are important in driving the separations. A detailed 
analysis of the relationship between multivariate analyses can be 
found in Ramette (2007). Here we take a metagenomes centric 
view and briefly introduce each statistical method, and describe 
its ability to separate metagenomes across environmental space. 

The analysis recapitulated the discriminating power of metage- 
nomics to identify differences in functional potential both between 
and within environments. A unique metabolic signature repre- 
sented each environmental microbial community: for example, 
the abundance of phage proteins was the major discriminator 
between host-associated microbial environments and free-living 
microbes. Subtle differences between open and coastal marine 
environments were associated with differences in the abundance 
of photosynthetic proteins. Cofactors, vitamins, and stress related 
proteins were consistently found in higher abundance in environ- 
ments where the conditions for microbial survival were potentially 
unstable, such as hydrothermal springs. Each of these differ- 
ences provides a clue for detailed microbiological analysis of 
communities. 

MATERIALS AND METHODS 

At the time of analysis, 212 metagenomes were selected from the 
set of publicly available data 1 . They were classified into 10 differ- 
ent environments depending on the description provided by the 
researcher that collected the samples (Table Al in Appendix). The 
metagenomes spanned a range of sequencing technologies, and 
most environments were represented by two or more sequencing 
technologies (Figure 1). The sample descriptions were provided 
as a geographical coordinate or a verbal description (e.g., coral 
reef water), these were translated into the environmental ontology, 
EnvO (Smith et al, 2006). EnvO environments were: saline evapo- 
ration pond; mat community; hydrothermal springs; human asso- 
ciated; other terrestrial animal associated; freshwater; and marine. 
Because of the abundance of samples from saline hydrographic 
features from the ocean (for example, Global Ocean Survey data), 
these samples were further sub-divided into four groups: open 
ocean, coastal water, deep water, and coral-reef water associated 
samples. The descriptions of metagenomes were mostly a geo- 
graphic location, which would place the sample in a clear habitat 
type; a description of host, e.g., human or animal type; or a verbal 
description of the habitat, e.g., hydrothermal springs. There is an 
unfortunate lack of auxiliary data, e.g., measurements of salinity, 
pH, temperature, that could be used to separate the samples along 
a gradient. As more environmental measurements are collected at 
the time of metagenome sampling, the two data types (environ- 
mental and genomic) can be analyzed simultaneously to provide 
direct evidence of how microbial communities differ across envi- 
ronmental gradients and some of the statistics that we present will 
useful for these analysis. 



1 http://edwards.sdsu.edu/mymgdb/ 
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FIGURE 1 | A comparison of the sequence length and number of sequences across the environmental groups. 



Publicly available metagenomes were selected from the Edwards 
Lab metagenome database (see text footnote 1) (Table Al in 
Appendix) . All samples were annotated using the real-time fc-mers 
based annotation system using a 10-amino acid word size and a 
requirement for at least two words per protein 2 . Real-time metage- 
nomics: uses signature A:-mers to identify the functions encoded in 
the metagenome sample (Edwards et al., 2012). The A:-mers based 
approach allows all of the samples to be annotated against the same 
core database, and for the annotations to be updated whenever 
required. The /c-mers based annotation provides the number of 
sequences for each function, subsystem, and two level hierarchies 
in the subsystems ontology (Henry et al, 201 1). This system works 
by comparing the DNA to previously annotated DNA housed in 
a range of databases which identifies a gene or subsystem that 
shows similarity. The gene is then grouped with other genes that 
contribute to a metabolic pathway. The pathways are grouped 
with pathways that are associated with similar metabolic func- 
tions to make the top hieratical metabolic function. For example, 
a sequence may be similar to Alanine racemase, which is used in 
Alanine Biosynthesis, which is one of the pathways in Amino acid 
metabolism; therefore in this case the microbial community would 
have a sequence in the Amino acid metabolism subsystem. The 
counts for each metabolic process are totaled and normalized by 
the total number of sequences that show similarity to any subsys- 
tem. Therefore the analyses used the percent of sequences in each 
metabolic or functional group as the data; the metabolic group is 
the response variable and the metagenomes as the observations. 
The 27 functional hierarchies used in the analysis were: Amino 



2 http://edwards.sdsu.edu/rtmg 



Acids and Derivatives; Carbohydrates; Cell Division and Cell Cycle; 
Cell Wall and Capsule; Cofactors, Vitamins, Prosthetic Groups, and 
Pigments; DNA Metabolism; Dormancy and Sporulation; Fatty 
Acids, Lipids, and Isoprenoids; Membrane Transport; Metabolism 
of Aromatic Compounds; Miscellaneous; Motility and Chemo- 
taxis; Nitrogen Metabolism; Nucleosides and Nucleotides; Phages, 
Prophages, and Transposable Elements; Phosphorus Metabolism; 
Photosynthesis; Plasmids; Potassium Metabolism; Protein Metab- 
olism; Regulation and Cell Signaling; Respiration; RNA Metabo- 
lism; Secondary Metabolism; Stress Response; Sulfur Metabolism; 
Virulence (Aziz et al, 2008). 

Common statistical techniques were used to explore the rela- 
tionship between the metagenomes, environments, and subsys- 
tems (Figure 2). The two key questions addressed were: (i) do 
metagenomes have a metabolic signature for each environment 
and (ii) what are the important metabolic processes driving that 
signature? Clustering analysis is useful for grouping objects into 
categories based on their dissimilarities and work well when there 
is discontinuities in the samples, i.e., they are collected from 
distinct environments, rather than where continuous differences 
are expected, i.e., they are collected along a single environmen- 
tal gradient. In general, statistical methods can be divided into 
two broad categories: supervised techniques and unsupervised 
techniques. Supervised techniques require that the samples be sep- 
arated into predetermined groups before the analysis begins, and 
those groups are used as part of the analytical methods. In this case, 
the metagenome samples were grouped according to the environ- 
ment where the sample was collected. In contrast, unsupervised 
techniques do not require a priori knowledge of the group sepa- 
rations, but the groups are generated by the statistical technique. 
In the all cases, we compare the resultant groups to the original 



www.frontiersin.org 



April 2013 | Volume 4 | Article 41 [3 



Dinsdale et al. 



Multivariate analysis of functional metagenomes 



Predetermined groups 



No 



Yes 



Clustering 



Visualization 



Unsupervised 

Random 

Forest 



Partitioning 
Around Medoid 



Principal 

Component 

Analysis 



Non-metric 

Multidimensional 

Scaling 



K-means 
Clustering 



Linear 

Discriminant 

Analysis 



Leave 
One out 
Validation 



Random 
Forest 



Classification 
Tree 



Mean Mean 
Decreasing Decreasing 
Accuracy Gini 



Canonical 

Discriminant 

Analysis 



Cross 
Validation 



FIGURE 2 | A diagram of the relationship between the seven statistical methods evaluated. 



sampled environment to determine the discriminating power of 
the analysis. 

When categorizing data, many statistical methods are prone to 
over-fitting the data - reading more into the data than is really 
there. To reduce the problem of over-fitting the size of the data 
sets should be increased, groups should be of similar size and the 
number of groups should be less that the number of variables. 
Sample size considerations are particularly relevant to metage- 
nomic data analysis, due to the nature of the data. There are 
thousands of proteins identified in each metagenome, but at the 
time of analysis there were <300 publicly available samples, mean- 
ing that there were many less samples than potential variables. 
Combining the proteins into functional groupings reduces the 
number of variables to be less than the number of samples avail- 
able (subsystems were used here, but other groups like COGs, 
KOGs, or PFAMs are also widely used for metagenome analy- 
sis (Reyes et al., 2010). The subsystem approach is standardized 
and identifies all the proteins that are within a metabolic group. 
We used BLAST to identify how many sequences are similar to 
each protein. The data consisted of 10 classifications (the environ- 
ments), 27 response variables (the functional metabolic groups), 
and 212 observations (the metagenomes). As the number of pub- 
licly available metagenomes increases the number of metabolic 
groups could be increased. We compared the outcome of the seven 
statistical analysis with the detailed methods are discussed below, 
and further discussion and source code for all of these opera- 
tions are provided in the online accompanying material 3 . A brief 
summary of each method is given in the results. 



3 http://dinsdalelab. sdsu.edu/metag. stats/ 



/(-MEANS CLUSTERING 

.K"-means clustering is an unsupervised method which aims to clas- 
sify observations into K groups, for a choice of K. This approach 
partitions observations into clusters in order to minimize the 
sum of squared distances from each observation to the mean of 
its assigned group. The function that is minimized is called the 
objective function describe in Eq. 1: 

n 

obj(^ 1) ...,^) = ^]™^|x»-n,| (1) 

i=l 

where x' 1 ' is an observation, u,i, . . ., u,j- are the means, and k 
is such that |xW — u^l is minimal. The result is K clusters 
where each observation belongs to the cluster with the closest 
mean. 

The JC-means algorithm starts by randomly selecting . . ., 
|X£ and placing all observations into groups based on minimizing 
the objective function using Euclidean distance. The group means 
are then recalculated using the observations in each cluster and 
replace the previous means, |xi, . . ., jJifc. The algorithm is repeated 
until additional runs no longer modify the group means or the 
partitioning of observations. 

An alternative method of choosing K, uses silhouettes (Mar- 
den, 2008), which test how well an observation fits into the cluster 
it has been partitioned into rather than the next nearest cluster. 
Silhouettes give a good indication of how spread out groups are 
from each other. Let a (z) = |x (!) — u.j.| and b(i) = ||xW — u,;|| 
where x^' is an observation in group k and I is the group with 
the next closest mean (Marden, 2008). A silhouette is then defined 
in Eq. 2: 
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silhouette (i) = 



fl(Q-fe(i) 
max {a (i) , b (i) } 



(2) 



Ideally, each observation is much closer to the mean of its group 
than to the mean of any other group. In this case, the silhouette 
would be close to 1. Similar to the sum of squares plot, one must 
be careful about choosing a minimal K which has a large aver- 
age silhouette width, though silhouette graphs frequently suggest 
a clear K to select. 

CROSS-VALIDATION OF CLASSIFICATION TREE 

To cross validate a tree, the data set is divided into k randomly 
selected groups of near equal size. A large tree is built using the 
data points in only k — 1 groups and pruned to give a sequence 
of subtrees. The tree and subtrees are used to predict the classes 
of the remaining data points, and these predictions are compared 
against the actual classes of those data points. The misclassification 
rate and the cross-validated deviance estimate are computed for 
each tree, and the process is repeated for each group. This fc-fold 
cross-validation procedure (Shi and Horvath, 2006) is typically 
repeated many times, so that different subsets are selected in each 
trial. The misclassification and deviance values for each tree size 
are averaged over there petitions, and the subtree that minimized 
the standard error in the misclassification rate or the lowest average 
deviance is selected. Trees constructed using cross-validation tools 
are typically less susceptible to over-fitting than other forms of 
classification. K-fold cross-validation is particularly appropriate 
for metagenomic data where there may be few samples in some of 
the environmental groups and as many samples as possible should 
be used to identify the right tree. 

SUPERVISED RANDOM FOREST OUT OF BAGGING DESCRIPTION 

Sampling the data with replacement generates a new dataset to 
grow each tree in the forest - a process called bagging ( bootstrap 
aggregating). The metagenomes that are chosen at least once dur- 
ing the sampling process are considered in-bag for the resulting 
tree, while the remaining metagenomes are considered out-of-bag 
(OOB). Upon mature growth of the forest, each metagenome will 
be OOB for a subset of the trees: that subset is used to predict 
the class of the metagenome. If the predicted class does not match 
the original given class, the OOB error is increased. A low OOB 
error means the forest is a strong predictor of the environments 
that the metagenomes come from. Misclassifications contributing 
to the OOB errors are displayed in a confusion matrix. The rows 
in the confusion matrix represent the classes of the metagenomes 
and the columns represent the classes predicted by the subsets of 
the trees for which each metagenome was OOB. Each class error, 
weighted for class size, contributes to the single OOB error. The 
OOB error and a confusion matrix are used to judge the misclassi- 
fication error and clarify where the errors occur, while the variable 
importance measure allows for identifying which variables are best 
at discriminating among groups. 

MEAN DECREASING ACCURACY AND GINI IN SUPERVISED RANDOM 
FOREST 

There are several approaches that work in conjunction with ran- 
dom forests to estimates the importance of variables in separating 



the data into groups. One uses the mean decrease in accuracy 
that a variable causes is determined during the OOB error cal- 
culation phase. The values of a particular variable are randomly 
permuted among the set of OOB metagenomes. Then the OOB 
error is computed again. The more the accuracy of the random 
forest decreases due to the permutation of values of this variable, 
the more important the variable is deemed. 

The mean decrease in Gini is a measure of how a variable con- 
tributes to the homogeneity of nodes and leaves in the Random 
Forest. Let p mgi be the proportion of samples of group gj in node 
m. Let g c be the most plural group in node m. The Gini index of 
node mG m is defined in Eq. 3: 



T,P 2 m g , 



(3) 



The Gini index is a measure of the purity of the node, with 
smaller values indicating a purer node and thus a lesser likelihood 
of misclassification (Brieiman et al, 1984). Tree generating algo- 
rithms may use this index as their likelihood to pick which variable 
to split on. Each time a particular variable is used to split a node, 
the Gini indexes for the child nodes are calculated and compared 
to that of the original node. When node m is split into m r and m;, 
there is a probability p mr of samples going into the child node m T 
and p mi of going into m\. The decrease (Brieiman et al, 1984) in 
Gini is defined in Eq. 4: 



D m — G m pm T G mr p m i G m; 



(4) 



The calculated decrease is added to the mean decrease Gini 
for the splitting variable and normalized at the end. The greater 
the mean decrease Gini of a variable, the purer the nodes splitting. 
Each time a particular variable is used to split a node, the Gini coef- 
ficients for the child nodes are calculated and compared to that of 
the original node. The Gini coefficient is a measure of homogene- 
ity from 0 (homogenous) to 1 (heterogenous). The decreases in 
Gini are summed for each variable and normalized at the end of 
the calculation. Variables that split nodes into nodes with higher 
purity have a higher decrease in Gini coefficient. 

MULTIDIMENSIONAL SCALING 

Multidimensional scaling is a visualization technique. Its goal is 
similar to PCA (see below). MDS takes for its input an n x n dis- 
similarity matrix S for n metagenomes, constructed by some other 
statistical technique, such as random forest. Then the algorithm 
looks for an embedding of the data points into some lower (such 
as 2 or 3) dimensional space that preserves the dissimilarity dis- 
tances as much as possible. This embedding can then be plotted to 
visualize the clusters and their distances. There are various algo- 
rithms to do this, and they are rather involved. Some try to match 
the original distances in the embedding as well as it can. Others try 
to preserve the original ordering of the distances, i.e., the farther 
apart two samples were originally; the farther apart their images 
will be under the embedding. 

LINEAR DISCRIMINANT ANALYSIS 

For a data set with predetermined groups, linear discriminant 
analysis (LDA) constructs a classification criterion which can be 
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used for predicting group membership of new data. LDA finds 
linear combinations of variables that best separate the groups, and 
chooses hyperplanes perpendicular to these vectors to split the 
data into two groups. 

Let X be a data set with defined groups 1, . . ., n. For each group 
j, there exists a corresponding conditional distribution describe 
in Eq. 5. 

X(i)\G(i)=j~fj (5) 

Furthermore, let itj represent the proportion of X that is con- 
tained in group j. To perform a LDA on X, we assume that each 
fj is normally distributed with an equal covariance matrix E, but 
with possibly different means [ij. Using maximum likelihood esti- 
mation theory, the linear discriminant functions can be derived 
in Eq. 6: 

gj (x) = log (nj) + xirVf - ^-s-VJ (6) 

Note that n;, [ij, and E are unknown parameters for our groups' 
conditional distributions, so we estimate them using our sample 
data X in an intuitive manner. Suppose X has N data points and 
group j has nj points contained in it. Then we estimate Ttj by 

itj = ^ , and [ij by (ij = £ . Let Sj be the sample covariance 

;=i ' 

matrix for group j calculated from X. Also E ; -, is taken to be II n 
of the pooled covariance matrix of X. Consequently, E; = Y, k 
for all k e {1, . . ., «}. Therefore, let E = % = E 2 = . . . = t k . 
With our population parameters estimated from our sample data 
X, the linear discriminant functions from Eq. 6 becomes described 
in Eq. 7: 

gj (x) = log (itj) + x£- l (ij - ^(ijt-'df (7) 

Note that (5) is a linear function since log (itj) — ||i ; E _1 |lJ 
is a constant. 

These gj's from (5) are our classifying functions. Since for a 
point x we sought to maximize Ttjfj, our classification criterion is 

assign x to group j if gj(x) > g k (x) for all k^j. 

With the classification criterion, decision boundaries between 
groups can be found. The decision boundaries are where the 
discriminant functions intersect. That is, the decision boundary 
between groups j and k is {x:gj(x) = g k (x}}. Therefore, the lin- 
ear discriminant functions split the data space into regions. Each 
region corresponds to a specific group and the decision boundaries 
separate the regions. 

The original derivation of LDA (Fisher, 1936), the classifier did 
not start with the multivariate normal distribution. Instead, he 
sought the linear combination of variables that maximized the 
ratio of the separation of the class means to the within group 
variance. The pooled covariance is used in his derivation, which 
assumes the covariance of the groups is equal. Even though our 



motivation and derivation are different we still end up with Fisher's 
coefficients (Venables and Ripley, 2002). 

To judge how well a given LDA acts as a classifier for new data, 
leave one out cross-validation can be can be used and is imple- 
mented in the Statistical Package R (2009). Let X be a data set with 
m data points, and with groups 1 , . . . , n. For an LDA carried out on 
X, leave one out cross-validation removes one observation, x^\ at 
a time from X, performs an LDA on the reduced data set, and then 
uses this new LDA to classify x^\ Since the group membership of 
x' 1 ' is already known, we can check if the quasi-LDA for X classi- 
fies x^' correctly or not. For every observation in X, the procedure 
of leaving one out, and classifying with a new LDA is performed. 
The number of p of misclassifications is found. The proportion 
plm is an estimate for the probability of the LDA carried out on X 
misclassifying a new observation. 

PRINCIPAL COMPONENT ANALYSIS 

Principal component analysis is a dimension reduction technique. 
It uses orthonormal linear combinations of the variables of the 
data, called principal components, to capture most of the vari- 
ance in a few dimensions. The idea is to choose the first principal 
component so that it has maximal variance, and each successive 
principal component so that it absorbs as much of the remain- 
ing variance as possible. The number of principal components 
of a dataset is equal to the number of variables, but most of the 
variance is concentrated in the first few. 

Given an n x q data matrix Y with corresponding q x q covari- 
ance matrix S, the q x 1 principal component vectors Vi ,. . ., v„ are 
described in Eq. 8: 



Since S is a symmetric matrix, the spectral theorem shows that 
all of its eigenvalues are real and that it has an orthonormal basis 
of eigenvectors (Marden, 201 1 ). Hence it follows that the principal 
components of Y are the eigenvectors of S ordered by decreasing 
eigenvalues. 

The principal components of Y capture all of the variance of 
the variables. PCA is an effective tool when the first few principal 
components account for most of the variance. In practice, being 
able to capture over 95% of the variance in the first two principal 
components is not unusual. Then the data can be plotted along the 
first two or three principal components to visualize clustering. If 
the first few principal components fail to account for most of the 
variance, it indicates that the data is inherently multidimensional. 

CANONICAL DISCRIMINANT ANALYSIS 

Canonical discriminant analysis centers on the construction of 
canonical components to explain the variance between classes. 
For a data set with variables (vi, V2,..., v k ), these canoni- 
cal components are linear combinations of the form shown in 
Eqs 9 and 10: 

Canl = flivi + a 2 v 2 + . . . + a k v k (9) 
Can2 = biVi + b 2 v 2 + . . . + b k v k (10) 
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For two-dimensional visualization it is necessary to project the 
variable vectors Vi and Vi, onto the canonical component axes 
Canl and Can2 (Marden, 2011). The projections of the variables 
maintain the relationship between their coefficient variables. That 
is shown in Eq. 11: 

a; a, a, a, 

— = — and — = — (11) 

bi b, bj bj 

The amount of the inter-class variance that is explained by each 
component is indicated in parentheses along each axis. The vec- 
tors can be rescaled to obtain the clearest visualization, but they 
must maintain the ratio of their lengths as this is proportional to 
their importance. Each sample is plotted according to its canonical 
scores. Let x be a sample, such that x= (x\,X2, . . ., x^) from a data 
set whose first canonical components are Ci and C 2 , such that the 
coefficients of C\ are (ai, a 2 , • ■ •> ^k) an d those of C 2 are (b\, b 2 , 
. . ., bk). Then we compute using Eq. 12: 



xC = x \ C\ C 2 \ = (Xl, x 2 , . . . , x k ) . . 

^ 1 1 ' \'L 

= (Ci (x) C 2 (x)) (12) 

The canonical scores of a sample x are (C\{x), C2(x)), which 
describe its position in the 2-dimensional space defined by the 
first two canonical components. The mean scores and confidence 
intervals of the means can also be plotted. 

The choice of group was determined by the minimal Maha- 
lanobis distance. The Mahalanobis measure is a scale-invariant 
distance measure based on correlation. The distance of a multi- 
variate vector x = {x\,Xz,. . .,x k ) from a group with mean u, = (|Xi, 
[i2, ■ • ■> u.«) and covariance matrix S is defined in Eq. 13: 

D M (x) = ^(x- i i)S- 1 (x- l i) T (13) 

More intuitively, consider the ellipsoid that best represents the 
group's probability density. The Mahalanobis distance is simply 
the distance of the sample point from the center of mass, divided 
by the spread (width of the ellipsoid) in the direction of the sample 
vector (Marden, 201 1). 

RESULTS 
OVERVIEW 

We begin by assessing the clustering of the metagenomes and test 
whether the clusters chosen to reflect the environmental signals 
are statistically supported (.fC-means, decision trees, and random 
forests). We then move on to methods to explore and visualize the 
underlying structure of the data (MDS, linear discriminant, prin- 
cipal components, and CDA). An outline of the statistical methods 
tested is shown in Figure 2. Obviously statistical analysis is not a 
linear process, and many of the techniques were influenced by the 
results from previous (or subsequent) analyses. Although this dis- 
cussion attempts to maintain a linear structure for readability, that 
is not always possible or appropriate. Often the researcher will have 
a specific biological question and a single specific statistical analysis 



will be appropriate. A combination of statistical tests can provide 
better visualization of the data. For example random forests are 
good at recognizing important variables and how the observations 
are divided or classified, but do not provide data visualization 
tools. Therefore, we used a random forest analysis to provide the 
clustering and a MDS plot to visualize the data. 

/(-MEANS CLUSTERING 

The most straightforward method to cluster data is by grouping 
into related sets. _K"-means clustering aims to classify observations 
into K groups by partitioning observations into clusters in order 
to minimize the sum of squared distances from each observa- 
tion to the mean of its assigned group. The _K"-means algorithm 
starts by randomly selecting a specified number of means and 
groups observations by assigning each one to the mean it is closest 
to in distance. The group means are then recalculated using the 
observations, replacing the previous means. The observations are 
reassigned to a group based on the distance between the value and 
the mean of the group. The algorithm iterates until the groups 
stabilize. The algorithm will converge to a local minimum, but 
not necessarily to a global minimum, therefore it is necessary to 
initialize and run the analysis many times. 

Varying the number of groups ( K) will result in different results 
from the _K"-means algorithm. The sum of squares of distances in 
general decreases as K increases, because there are more groups 
in which to assign observations. Selecting K with the smallest 
sum of squares will over-fit the data. In fact, when K is the 
number of observations, each observation will form a group by 
itself and the sum of squares will be 0; but this does not give 
any useful information about the data. A plot of the sum of 
squares versus values of K is useful for determining an optimal 
value of K (Figure 3A). K is often selected where the plot has an 
"elbow." However, with metagenomic data, the plot often appeared 
rounded (Figure 3A), therefore, we optimized using silhouettes 
(Rousseeuw, 1987) instead. The silhouette of an observation is the 
difference between its distances from the closest of the JC-means 
and the second closest, divided by its distance from the second 
closest mean. In the best possible case, the observation is close 
to its own mean and not very close to the second best mean, 
i.e., its silhouette is close to 1. The set of all silhouettes (one for 
each observation) for K from 1 to 10 is shown in Figure 3B. For 
each value of K we calculate the average silhouette width, and 
use K that optimizes the width of the silhouettes. We found a 
maximum at K= 6, with another smaller optimal width at K= 10 
(Figure 3C). 

The JC-means algorithm was most useful for identifying out- 
liers, which could be checked visually and removed as required. 
Using K= 6 groups, identified two broad categories, (1) the 
aquatic group cluster and (2) the human, terrestrial animal asso- 
ciated and mat community cluster (Table 1), but the remaining 
four groups were small and consisted of samples that were poten- 
tial outliers. The advantage of the K"-means approach was that it 
showed broad patterns in the metagenomic data. If the researcher 
did not know how many groups were in the dataset, this analy- 
sis would be a good place to start the analysis. The disadvantage 
was that it does not provide any information about which meta- 
bolic processes were driving the broad scale separations in the 
metagenomes. 
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FIGURE 3 | (A) The sums of squares and /(-value used to identify 
the number of groups that the samples should be split into. No clear 
elbow was evident; therefore silhouette plots were used to examine 
the data. (B) A silhouette plot showing how it creates metagenomic 



groups in the data. The most favorable grouping number is where 
the average silhouette width is nearest to one. (C)The variation of 
average silhouette width and K. There is a peak at K= 6 and an 
uptick at K= 10. 



CLASSIFICATION TREES 

A supervised decision tree constructs a classification tree by iden- 
tifying variables and decision rules that best distinguish between 
predefined classes (supervised). If the response variable is con- 
tinuous, instead of predefined classes, a regression tree can be 
constructed which predicts the average value of the response vari- 
able. Either of these trees is suitable for metagenomic data, but 
since we were interested in separating the data by environment 
we used classification trees. Trees are invariant under monotonic 
transformations of the response variables, because constructing a 
tree uses binary partitions of the data and thus most variable scal- 
ing is unnecessary (De'ath and Fabricius, 2000; De'ath, 2002). This 



feature is particularly important, because a mixture of data can be 
included in the analysis, e.g., the percent of sequences similar to 
a metabolic process or the pH where the metagenome was col- 
lected. Combining genomic and environmental data will be useful 
in future analyses. 

The construction of a supervised tree minimizes the mixing 
of the different predefined classes within a leaf (called the node 
impurity). At each branching point, the algorithm chooses a single 
variable and a value that splits the node minimizing the impu- 
rity. (There are several ways to measure impurity, as described in 
the methods) In general, trees are a balance between classifica- 
tion strength and model complexity with the goal of maximizing 
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Table 1 |The samples present in each of the clusters identified by the 
/(-means analysis with K = 6. This was chosen because the silhouette 
analysis suggested that six clusters were the most appropriate (Figure 3). 
There were 33 human, 9 terrestrial animal, 10 mat community, 42 open 
water, 20 reef water, 60 coastal water, 5 deep water, 7 fresh water, 15 
hypersaline, and 6 hot spring samples in total. 



Cluster 



Number of 
metagenomes 



Original metagenome 
classification 



52 



149 



31 Human 
5 Terrestrial animals 
6 Mat community 
Water samples: 

4 Open marine 
3 Coral reef 

2 Coastal marine 

1 Fresh 

1 Coral reef water sample 
1 Coral reef water sample 
1 Human 
1 Fresh water 
1 Coral reef water 
4 Mat 

4Terrestrial animals 
1 Human 
Water samples: 
56 Coastal marine 

5 Deep marine 

15 Solar evaporation ponds 

6 Hydrothermal spring 
38 Open mainre 

13 Coral reef 

7 Freshwater 
Water samples: 

2 Coastal marine 

3 Freshwater 
1 Coral reef 



prediction strength and minimizing over-fitting. Often a large tree 
is grown that over-fits the data, and pruning and cross-validation 
are used to select the most appropriate sub-tree of that original 
tree (Brieiman et al., 1984). 

Unlike JC-means clustering, decision tree classification provides 
information about the variables that drive the separation. The best 
classification tree using all the variables was determined by 500 
runs of 10-fold cross-validation (Table 2). The cross-validation 
identified three trees that gave similarly low values, the 6, 8, and 
9-leafed tree. These were visually inspected to see which tree gave 
information without being over-fitted and this was the 9-leafed 
tree. This classification tree (Figure 4) demonstrated that phage 
proteins separated the host-associated microbial communities and 
the majority of free-living communities. In particular, and as has 
been shown before (Oliver et al, 2009; Reyes et al., 2010), the host- 
associated communities and some microbial communities from 
the fresh water and hypersaline environments characteristically 
had more phage proteins. 



Table 2 | Tree size and average deviance from a series of tree 
cross-validation experiments. 



Tree size 



Average CV deviance 



1 

2 
3 
4 
6 
8 
9 

14 
16 
17 



152.014 
122.432 
102.636 
99.642 
92.762 
92.970 
92.812 
95.848 
98.342 
98.622 



Phot osynthesis < 0.0038 



Phages < 0.0023 



Cofactors < 0.081 



Nitre gen 
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<0.0 
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112 



< 0.0? 94 
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FIGURE 4 | A classification tree showing the separation of 
metagenomes from different environments based on the abundance 
of the subsystems in each environment. The abundances are normalized 
as described in the methods. The tree has been pruned to only show the 
eight most important variables. 



Harsh environments (such as hypersaline aquatic environ- 
ments) had more cofactors, vitamins, and pigments. Within the 
marine realm, the coastal and deep water samples had, as expected, 
fewer photosynthetic proteins than the open water samples, but 
the photosynthetic potential of the reefs was mixed (Dinsdale 
et al., 2008a). Photosynthetic potential also aided the identifica- 
tion of stratification in the mat microbial communities by depth, 
a separation that was supported by metabolism that occurs in 
microaerobic or anoxic conditions. The major advantages of clas- 
sification trees are the ability to use any continuous variable type, 
fast calculation time, good visualization, and the ability to calculate 
misclassification rate. The use of classification tree in association 
with environmental data in the future will be able to show the 
interactions between the environmental and genomic characteris- 
tics. The disadvantage is the tendency for over-fitting the trees and 
the lack of stability: small changes in the data, such as adding one 
more sample, can yield dramatically different results. 
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FIGURE 5 | Variable importance determined by the random forest 
analysis using mean decrease in (A) Accuracy and (B) Gini. 



RANDOM FORESTS 

The random forest (Brieiman, 2001) technique aims to overcome 
the limitations of the classification tree by generating a large 
ensemble of trees from a random subset of the data and a ran- 
dom selection of the variables. The resulting ensemble of trees 
(the random forest) is then used with a majority- voting approach 
to decide which metagenomes belong to which groups. The com- 
putation is not excessive: a random forest with 1000 trees trained 
on 212 metagenome datasets was computed in a few seconds. The 
speed of calculation and bootstrapping nature of random forests, 
may pave the way for calculations across all proteins in all envi- 
ronments, thus reducing the amount of grouping conducted on 
the data. The random forest is typically used to classify the data 
into predefined groups (a supervised random forest). A subset of 
the data and variables is used to generate the trees, and thus the 
approach can predict the environment to which a metagenome 
belongs. The random forest does not produce branching rules like 
a single classification tree because the trees in the random forest 
all differ from one another. Instead, the most parsimonious tree 
is calculated using bagging (Table Al in Appendix). In addition 
to bagging, the RF generates a measure of the importance of each 
variable, calculated by either the mean decrease in accuracy or the 
mean decrease in the Gini (Figure 5). These two values indicate 
which variables contributed the most to generating strong trees 
and can be used in other visualization analyses such as MDS or 
CDA as described below. 

In an unsupervised random forest, the metagenomic data is 
classified without a priori class specifications. Therefore, unsu- 
pervised random forests remove researcher bias. Synthetic classes 
are generated randomly and the forest of trees is grown. Similar 
metagenomes will end up in the same leaves of trees due to the 
tree branching process, and the proximity of two metagenomes is 
measured by the number of times they appear on the same leaf. 
The proximity is normalized so that a metagenome has proxim- 
ity of one with itself and 1 -proximity is a dissimilarity measure 
(Shi and Horvath, 2006). The strength of the clustering detected 
this way may be measured by a "partitioning around the medoids" 
(PAM) analysis (Marden, 2011). Conceptually similar to the K- 
means clustering, PAM picks K metagenomes called medoids, and 
creates clusters by assigning each metagenome to the group rep- 
resented by its closest medoid. The algorithm looks for whichever 
K metagenomes minimize the sum of the distances between all 
metagenomes and their assigned medoids. 

Overall, the photosynthesis and phage groups were the most 
important response variables in separating the data sets, and in 
the mean decreasing accuracy plot a break occurred between 
these two variables and the remaining variables, suggesting that 
just these two measures could be used to grossly classify the 
metagenomes (Figure 5). The next break appeared after the eighth 
variable. These eight variables were thus chosen for the CDA 
analysis described below. The misclassification rate of the ran- 
dom forest analysis was 31% (Table 3) and these misclassifications 
occurred because metagenomes from the various marine environ- 
ments were mixed. The marine environment categories of open 
ocean, coastal waters, coral reef, and deep ocean, share many meta- 
bolic features and therefore these metagenomes were placed into 
categories different than their a priori group assignment. This 



suggests subtle variation in metabolic processes that are occurring 
in the microbial communities from each environment that should 
be investigated in the future. 

The advantages of the random forest are that it is a rapid clas- 
sification technique that is less susceptible to over-fitting data and 
can be run in a bootstrap fashion. In addition, the random forest 
provides a measure of the importance of each variable that can be 
used in other analyses. These advantages of random forests mean 
that the metagenomes could be analyzed on the gene level, rather 
than the higher subsystem level. The disadvantage is that because 
each forest is an ensemble of trees, identifying individual classifi- 
cation decisions is not possible, which is why we plotted the data 
using a MDS. 
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Table 3 |The group that each metagenome was assigned to by the random forest analysis. 
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MULTIDIMENSIONAL SCALING 

Multidimensional scaling is a visualization tool that directly scales 
objects based on either similarity or dissimilarity matrices (Quinn 
and Keough, 2002). MDS projects the proximity measures of 
the metagenomes as determined by other techniques to a lower- 
dimensional space (e.g., 2-dimensional space for plotting on xy- 
axis). For the random forests, the similarity was measured as the 
number of times two metagenomes appeared on the same leaf in 
the trees (proximity), and is represented by the distance between 
two samples on the MDS plot. The MDS plots are colored either 
by the five PAM groupings from the random forest (Figure 6A), 
or the 10 predefined environments (Figure 6B). In this analysis, 
the visualization highlights the separation of the microbes from 
human/animal hosts from other samples along the first dimension 
and the separation of the aquatic and mat communities along the 
second dimension. 

It is important to note that MDS is a visualization tech- 
nique that takes its input from other classification or clustering 
approaches. MDS is useful for showing which metagenomes have 
similar features, because metagenomes that are positioned closer 
together will be more similar to each other than those farther apart 
on the plot. 

LINEAR DISCRIMINANT ANALYSIS 

Linear discriminant analysis is a supervised statistical technique 
that aims to separate the data into groups based on hyperplanes 
and describe the differences between groups by a linear clas- 
sification criterion that identifies decision boundaries between 
groups. 

The LDA over all 27 metabolic variables separated the data 
(Figure 7) and showed that the human and terrestrial animal 



associated metagenomes separated from a cluster consisting of 
all of the aquatic samples except the hypersaline community. The 
mat samples separated distinctly from the other clusters. A leave 
one out cross-validation showed that the LDA misclassified 36% 
of the samples. Most of the misclassified samples were from the 
aquatic metagenomes that are difficult to separate (as discussed 
below). Even though it is likely that the data does not meet all 
the requirements for an LDA, including the assumption of equal 
population group covariance, a linear function of the variables is 
still able to separate the groups. We derive the linear discriminant 
functions assuming the data is normally distributed for simplicity, 
but this is not necessary. The advantages of LDA are the ability 
to both visualize the data and obtain a statistically robust classi- 
fication, but the disadvantage includes the assumption of equal 
population covariance. 

PRINCIPAL COMPONENT ANALYSIS 

Principal component analysis is one of the most widely used sta- 
tistical analyses for genomics data because it is a straightforward, 
robust data reduction technique that is trivial to apply to large data 
sets. PCA selects linear combinations of the variables sorted so that 
each combination accounts for as much of the sample variance as 
possible, while being orthogonal to the previous combinations. 
These combinations of the variables are called the principal com- 
ponents. The goal of PCA is to explain as much of the variance as 
possible in the first few components, and thus reduce the complex- 
ity of the data by combining related variables. We began with the 
eight most important variables identified by the random forest, 
and used PCA to reduce these to a two-dimensional plot. Figure 8 
shows a PCA plot of the first two principal components of the data 
set, and shows the directionality of the importance of each variable. 
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FIGURE 6 | Multiple dimensional scale plots of the distances calculated 
from the unsupervised random forest. The distances are the number of 
times the samples appear on the same leaf of the tree, and the MDS has 
scaled them so that they plot projects those distances into two dimensions. 
Colored by (A) the five PAM groupings suggested by the random forest 
(see text); or (B) the original environments the samples came from. 



The data was positioned on a plane which was influenced by a 
high percent of sequences associated with DNA metabolism, cell 
division, and amino acid metabolism in one direction, and viru- 
lence and RNA metabolism in the other, with cofactor metabolism 
important in both directions. The metagenomes did not separate 
particularly well with this analysis, however human and terrestrial 
animal associated samples clustered above aquatic samples. The 
first two dimensions of the PCA did not provide good resolution 
of the nuances within an environment, explaining only 38% of the 
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FIGURE 7 | Linear discriminant analysis showing the position of the 
metagenomes in two-dimensional space from the 10 environments. 
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FIGURE 8 | Principal component analysis of the 212 metagenomes 
using the top eight variables identified from the random forest 
analysis. The samples are colored and shaped by the environment where 
they came from. The samples are largely aligned on a 45' plane from 
virulence-DNA metabolism to amino acids-cofactors. 



variance. This suggests that a large number of components were 
needed to explain the variance in our data and highlights a prob- 
lem with PCA: it is not able to reduce the complexity of the data 
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FIGURE 9 | Canonical discriminant analysis of the 212 metagenomes 
using the top eight variables identified from the random forest 
analysis. The plot shows the separation in the host-associated microbial 
communities and the free-living communities. The analysis explained 91 % 
of the variance, suggesting that metagenomes can be discriminated by the 
metabolic potential. Lines depict the h-plot of important metabolic 
processes and the points are the centroid or mean for the 10 environments. 



if the variables are not correlated. The lack of correlation in the 
variables can be seen in Figure 8, because the metabolic processes 
are facing in different directions around the graph. There is no 
grouping of any of the 8 metabolic processes shown. We did get 
better resolution with PCA on certain subsets of the data for exam- 
ple, using some of the organism-associated metagenomes. In this 
case the first two principal components accounted for 79% of the 
variance. We did not include those graphs in this paper. 

The advantages of the PCA are that it reduces the complex- 
ity of the data, especially if many of the variables are correlated, 
and it provides a mechanism for visualizing higher-dimensionality 
data. The disadvantages of the PCA are that it does not classify the 
metagenomes into groups and if the variables are not correlated it 
is unable to reduce the dimensionality of the data. 

CANONICAL DISCRIMINANT ANALYSIS 

Canonical Discriminant Analysis is another approach to reduce 
the dimensionality of the data, similar to PCA and LDA. However, 
in addition to visualizing the data, CDA can be used to classify 
the data into pre-assigned groups. Like the PCA, CDA searches for 
linear combinations of variables that explain the data. Like a super- 
vised random forest, CDA can be used to explore the variables 
responsible for differentiating between groups. 

CDA identifies variation between groups by identifying the 
linear combination of variables that has the maximum multiple 
correlations with the groups. The second component is the linear 
combination that has the highest possible multiple correlations 
within the groups and is uncorrelated with the first component. 
The process is repeated using all the data, and providing one fewer 
components than variables. A fundamental difference between 
PCA and CDA is the covariance matrix: in the former the covari- 
ance matrix displays the variance between individual samples, 
while in the latter it displays the variance between groups. As 
with the PCA, we explored the effect of the eight most impor- 
tant response variables on the separation of the 212 metagenomes 
using CDA (Figure 9) and found the mediods of the groups 
and vectors that demonstrate the directionality of the impor- 
tance of each variable. The length of the vector in the plot is 
proportional to the importance of that variable in separating 
the data. 

The CDA showed that the host-associated microbial commu- 
nities were separated from the other environments by the abun- 
dance of sequences similar to phage and dormancy proteins. The 
harsh hydrothermal springs were again associated with the need 
for cofactors. The photosynthetic potential separated the coastal 
and open water metagenomes. Membrane transport, protein and 
nitrogen metabolism were also important in separating the aquatic 
from host-associated metagenomes. The analysis explained a large 
amount of the variance (91%) showing the importance of a key 
set of metabolic processes in each environment. However, the mis- 
classification rate of the CDA was 39.7%. Once again the largest 
misclassification occurred between the metagenomes collected 
from the four marine environments (Table 4). 

The advantages of the CDA are that it combines the dimension- 
ality reduction of the PCA with the classification of the random 
forest or K-means approaches. The disadvantages of the CDA 
are that the metagenomes are placed into predefined groups and 



thus are subject to observer bias, and CDA is prone to over-fitting 
because the canonical components are linear combinations that 
best separate the groups. 

DISCUSSION 

Metagenomic data provides a wealth of information about the 
functional potential of microbial communities, but the vastness 
of the data makes it difficult to discern patterns and important 
discriminators. A range of clustering, classification and visualiz- 
ing techniques were applied to analyze metagenomic data, and 
demonstrated the ability of the metabolic profiles to describe 
the difference between environments. The results show that a 
mixture of methods provides an effective analysis of the data: 
_K"-means was used to identify outliers, random forests to iden- 
tify the most important variables, and either a classification tree 
or CDA to test the relevance of the environment to genomic 
content. 

The data generation processes could cause differences in the 
classification or separation of the data. However the samples came 
from multiple sources, each of which employed a range of iso- 
lation, purification, and sequencing techniques. There was no 
evidence of clustering of samples prepared or sequenced in a spe- 
cific manner, suggesting that the sampling technique per se is not 
driving the separation of the data. 

The analyses separated the microbial samples into three broad 
groups (based on the environments from where they were iso- 
lated): the human and animal associated samples, the microbial 
mats, and the aquatic samples. There was a clear difference between 
environments. For example, human associated and aquatic sam- 
ples were clearly separated by all of the techniques. However, 
samples from a similar environment were often misclassified. For 
example, the coastal and open water metagenomes were difficult 
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Table 4 |The misclassification table generated by the canonical discriminant analysis. 
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n nnn 
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n nnn 
u.uuu 


n nnn 
u.uuu 


n QQR 
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n nnn 
u.uuu 


n yiQQ 
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U.Zo I 


n nnn 
u.uuu 
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n noi 
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U.U / 0 
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n nnn 
u.uuu 
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u.zu / 


fi 9fi.fi 
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n nnn 
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n niA 
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n nnn 
u.uuu 


n nnn 
u.uuu 


n in/1 


Solar evaporative ponds 


1.231 


0.000 


0.000 


0.000 


1.485 


0.000 


0.283 


0.000 


0.000 


0.000 


0.504 


Mat community 


0.382 


0.000 


0.000 


0.004 


0.000 


1.613 


0.000 


0.000 


0.000 


0.000 


0.193 


Open marine water 


4.377 


0.009 


0.033 


0.448 


0.169 


0.349 


2.410 


0.169 


0.014 


0.018 


0.698 


Coral reef 


1.509 


0.009 


0.283 


0.429 


0.000 


0.226 


1.117 


0.235 


0.023 


0.377 


0.994 


Hydrothermal spring 


0.047 


0.000 


0.000 


0.000 


0.000 


0.000 


0.113 


0.004 


0.834 


0.000 


0.165 


Terrestrial animal 


0.287 


0.000 


0.108 


1.193 


0.000 


0.216 


0.000 


0.000 


0.000 


0.193 


0.903 



to classify. More sampling and more thorough description of the 
environmental parameters will clarify the classification of these 
samples. 

The combination of random forests and CDA demonstrated 
that phage activity is a major separator of host-associated micro- 
bial communities and free-living or environmental microbial com- 
munities, suggesting that the phages are playing different ecolog- 
ical roles within each environment. In free-living microbial com- 
munities, phages are major predators and generally show similar 
diversity to their hosts. In host-associated microbial communities, 
phages are more diverse suggesting that they may provide specific 
genes to increase host survival (Reyes et al., 2010). The mat com- 
munities separated from both the animal associated metagenomes 
and the aquatic samples by the vitamin and cofactor metabolism, 
suggesting a role for secondary metabolism associated with growth 
in extreme environments. The dominant metabolism that sepa- 
rated the aquatic samples was photosynthesis. Not surprisingly, 
samples from deep in the ocean, and some of the impacted reef 
sites, do not have many photosynthetic genes, while photosynthetic 
genes abound on unaffected reefs and in surface waters of the open 
ocean. Although only the one or two most abundant phenotypes 
in each sample were described here, the statistical analysis reveals 



less obvious separations among the data, and unraveling the role 
of microbes in the global geobiology is an important goal for 
post-metagenomic studies. 

In summary, we hope that the statistical tools described here 
will help microbial ecologists broaden the range of statistical 
tools that are used in metagenomic data and help them parse 
out the important and interesting nuances that separate different 
environments. 
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APPENDIX 

Table A1 | Metagenomes used in the analysis. 
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Polynesia 


Global ocean samp 


ng 


92,501 


94,424,378 


Coasta 


water 


4441613 


GS117a 
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124,435 


133,251,132 



(Continued) 
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Table A1 | Continued 



Environment 


Genome ID 


Genome name 
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Plymouth 


qq aqq 


99 CPi^ -1Q7 
ZZ, Oo4, 1 o / 


Coasta 


water 


4443711 


SRS000536_2 




Marine synechococcus 
experiment 


333,462 


34,334,174 


Coasta 


water 


4443712 


mb2000jd298_2 




Monterey bay microbial 
study 


194,144 


46,983,239 


Coasta 


water 


4443713 


mb2000jd298_1 




Monterey bay microbial 
study 


217,549 


51,966,974 
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Environ ment 


Genome ID 


Genome name 


Project 


Num. of 
sequences 


Total 

length (bp) 


Coasta water 


AAA r XT\ A 
444o 1 1 4 


mh9Dni iH11R 1 
IT lUZUU I JU I I 0_ I 


Monterey bay microbial 
study 


1 PR 179 
I oO, I / Z 


A A 1 PQ R10. 

44, i oy,o I u 


^UdbLdl VVdLWI 


AAA P7 1 R 
444o / I O 


mh?nniiH11R 9 
II I UZUU I JU I I o z 


IVIUllLclcy Udy l[IIL.IUUIdl 

study 


173 161 


ao Rpn 7n 

4U,DOU, / I O 


^UdbLdl VVdLWI 


AAA"~K~I 1 P 


onouuuzoo 


OdptilU Ibldl IU 

metagenome 


49 524 


A 71 Q R90 
4, / i y, ozu 


^UdbLdl VVdLWf 


4440 1 1 Z) 


CRCnnn?9q 
onouuuzoy 


odptilU Ibldl IU 

metagenome 


46 421 


a pri mn 

4,oO I ,UoU 


^UdbLdl VVdLUl 


AAA119C) 


onouuuz'fu 


OdptilU Ibid! IU 

metagenome 


44 317 


A 90.Q 1 RP 
4,zuy, I Do 


^UdbLdl VVdLUf 


AA/LP791 
44tJ / Z 1 


jnouuuz'fz 


OdptilU Ibldl IU 

metagenome 


Q QR7 
y,yo 1 


yoo,4 / u 


^UdbLdl VVdLUI 


AAA^lOl 


cRcnnn?Ai 

onouuuz'f I 


odpeiu ISIdl IU 

metagenome 


A1 RP7 
4 I ,Oo / 


q pan np9 
o,oyu,uoz 


Coastal water 


4443724 


SRS000243 


Sapelo island 
metagenome 


30,673 


2,940,585 


Deep water 


4441025 


Mediterranean Bathypelagic Habitat 


Mediterranean 
bathypelagic habitat 


9,047 


7,202,361 


Deep water 


4441041 


HOT/ALOHA - Below Base of Euphotic Zone 200m 


Hot/aloha 


8,276 


7,829,627 


Deep water 


4441056 


HOT/ALOHA - Deep Abyss 4000m 


Hot/aloha 


11,223 


11,028,802 


Deep water 


4441057 


HOT/ALOHA -Well Below Upper Mesopelagic 500m 


Hot/aloha 


9,017 


8,764,614 


Deep water 


aaa-\ nco 
444 lUoz 


HOT/ALOHA — Core of Disso ved Oxygen Minimum 
Layer 770m 


1— 1 /'i I^K '-i 

not/aiona 


1 1 A ~7Q 

I 1 ,4/o 


11 Q11 COR 

1 1 ,o 1 1 ,oyb 


Deep water 


AAA"] RQfl 

444 i oyu 


GS020 — Fresh Water — Panama Canal — Lake 
Gatun - Panama 


Global ocean samp ing 


9QK QRR 


o lo, lo i , ioy 


Freshwater 


AAA QC7Q 
444oD / 3 


AntarcticaAquatic_3 


Antarctica aquatic 
microbial 


1 n n/i o 

I U,U4Z 


Q 7RR Q 1 R 
y, / OD,0 I D 


Freshwater 


444oOoU 


AntarcticaAquatic_2 


Antarctica aquatic 
microbial 


Q R79 


Q R99 9Q1 

y,ozz,zo I 


Freshwater 


AAA QRP1 
444JOO I 


AntarcticaAquatic_4 


Antarctica aquatic 
microbial 


R/1 AAP. 
D4,44D 


R/1 Q9Q 7CQ 

o4,yzy, /oy 


Freshwater 


/I A A QCDQ 
444JDOJ 


AntarcticaAquatic_1 


Antarctica aquatic 
microbial 


1 nn noc 
I UU,Uoo 


IU I ,o IU,4/0 


Freshwater 


444oOo4 


AntarcticaAquatic_6 


Antarctica aquatic 
microbial 


noi /ion 
zo 1 ,4yU 


zo i ,uoo,oy i 


Freshwater 


4443685 


AntarcticaAquatic_7 


Antarctica aquatic 
microbial 


28,481 


28,413,296 


Freshwater 


4443687 


Anta rcticaAq uatic_9 


Antarctica aquatic 
microbial 


95,521 


95,664,001 


Freshwater 


4440411 


PrePondKentSTMic20060504 


Freshwater from 
aquaculture facility 


44,094 


4,428,989 
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Environment Genome ID Genome name 



Freshwater 4440413 TilPondKentSTMic20060504 



Freshwater 

Freshwater 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 



4440422 TilPondKentSTMic200608 



4440440 TilPondKentSTMic200511 



4441092 Australian Phosphorus Removing (EBPR) Sludge 



4441093 US Phosphorus Removing (EBPR) Sludge 



4440453 TS1 



4440454 TS2 



4440461 TS4 



4440462 TS5 



4440463 TS6 



4440595 TS3 



4440610 TS19 



4440611 TS20 



4440613 TS28 



4440614 TS49 



4440615 TS50 



4440616 TS29 



4440639 TS21 



4440640 TS51 



4440823 TS7 



4440824 TS8 



Project 

Freshwater from 
aquaculture facility 

Freshwater from 
aquaculture facility 

Freshwater from 
aquaculture facility 

Phosphorus removing 
(ebpr) sludge 

Phosphorus removing 
(ebpr) sludge 

Gut microbiome 
Twin study 
Twin study 
Twin study 
Twin study 
Twin study 
Twin study 
Twin study 
Twin study 
Twin study 
Twin study 
Twin study 
Twin study 
Twin study 
Twin study 
Twin study 



Num. of Total 
sequences length (bp) 



63,978 



67,612 



96,563 



6,484,135 



6,932,903 



381,076 38,804,235 



100,273,005 



127,953 120,938,054 



217,386 51,708,794 



443,640 78,853,892 



414,754 95,003,113 



490,776 100,599,979 



535,763 118,207,161 



510,972 102,717,417 



498,880 82,117,565 



495,040 98,053,098 



302,780 101,434,082 



519,072 91,987,878 



549,700 111,999,603 



502,399 173,386,030 



413,772 88,786,017 



434,187 81,330,211 



555,853 134,889,015 



414,497 100,520,072 
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Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Human 
associated 

Solar 

evaporation 
ponds 

Solar 

evaporation 
ponds 

Solar 

evaporation 
ponds 

Solar 

evaporation 
ponds 



4440825 TS30 



4440826 TS9 



4440939 human F1-S 



4440940 



4440941 



4440942 



4440943 



4440945 



4440947 



4440950 



4440951 



human F1-U 



human F1-T 



human F2-V 



human F2-W 



4440944 human F2-X 



human In-B 



4440946 human In-A 



human F2-Y 



4440948 human In-D 



4440949 human In-M 



human In-E 



human In-R 



4441050 Marine NaCI-Saturated Brine 



4441599 GS033 - Hypersaline Floreana Island - Ecuador 



4440324 LowSalternSDbayMic20051110 



4440329 SaltonSeaMic20060823 



Project 

Twin study 
Twin study 
Human 

feces - kurokawa 
Human 

feces - kurokawa 
Human 

feces - kurokawa 
Human 

feces - kurokawa 
Human 

feces - kurokawa 
Human 

feces - kurokawa 
Human 

feces - kurokawa 
Human 

feces - kurokawa 
Human 

feces - kurokawa 
Human 

feces - kurokawa 
Human 

feces - kurokawa 
Human 

feces - kurokawa 



Num. of Total 
sequences length (bp) 



Human 
feces - 

Marine 
brine 



kurokawa 
nacl-saturated 



495,865 

499,499 

28,900 

16,539 

36,326 

36,455 

30,198 

31,237 

9,958 

20,226 

35,177 

37,296 

16,164 

20,532 

34,797 

2,947 



Solar saltern 



Solar saltern 



49,074 



178,407 



94,405,318 

124,768,172 

38,010,851 

24,369,492 

43,259,070 

45,906,118 

40,076,128 

39,071,077 

14,499,070 

29,296,224 

45,480,292 

46,397089 

25,941,797 

27,208,886 

43,473,860 

2,380,900 



Global ocean sampling 692,255 729,708,089 



4,632,200 



18,876,339 
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Environment 


Genome ID 


Genome name 


Project 


Num. of 
sequences 


Total 

length (bp) 


Solar 

evaporation 
ponds 


4440416 


MedSalterSDbayMic20051128 


Solar saltern 


8,062 


705,995 


Solar 

evaporation 
ponds 


4440419 


HighSalternSDbayMic20051128 


Solar saltern 


35,446 


3,711,295 


Solar 

evaporation 
ponds 


4440425 


MedSalternSDbayMic20051 11 6 


Solar saltern 


120,987 


11,867,028 


Solar 

evaporation 
ponds 


4440426 


LowSalternSDbayMic20051128 


Solar saltern 


34,296 


3,453,306 


Solar 

evaporation 
ponds 


4440429 


HighSalternSDbayMicB200407 


Solar saltern 


39,553 


4,028,912 


Solar 

evaporation 
ponds 


4440430 


HighSalternSDbayMicA200407 


Solar saltern 


78,524 


7,982,909 


Solar 

evaporation 
ponds 


4440433 


HighSalternSDbayMicC200407 


Solar saltern 


123,879 


12,641,571 


Solar 

evaporation 
ponds 


4440434 


MedSalternSDbayMic20051111 


Solar saltern 


23,261 


2,323,241 


Solar 

evaporation 
ponds 


4440435 


MedSalternSDbayMic20051110 


Solar saltern 


38,929 


3,905,955 


Solar 

evaporation 

[JUI iub 


4440437 


LowSalternSDbayMic200407 


Solar saltern 


268,206 


25,280,522 


Solar 

Qi/annratinn 
fcfVdpUl d LIUI I 

ponds 


4440438 


HighSalternSDbayMicD200407 


Solar saltern 


340,725 


34,806,789 


IVId L 

community 




i orrarn Mortm 1 O mm 
OUcllfcflU INcyiU I — Zllllll 


nypeibdiiMc yuwiiu 
negro 


I I ,ODZ 


7ZLRQ ?7R 


Mat 

community 


4440QR4 


fn I lorrorn Monro D 1 mm 

VJ uci I d \j iNGyiu l* — I mini 


I— K/norcal i no rti lorrn 
i i y [Jci odi 1 1 1 1; yucMU 

negro 


12 213 


R RQfi 197 


Mat 

community 


4440965 


Guerrero Negro 2-3 mm 


Hypersaline guerro 
negro 


12,407 


8,286,254 


Mat 

community 


4440966 


Guerrero Negro 3-4 mm 


Hypersaline guerro 
negro 


12,821 


8,214,974 


Mat 

community 


4440967 


Guerrero Negro 4-5 mm 


Hypersaline guerro 
negro 


15,652 


9,803,688 
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Environ ment 


Genome ID 


Genome name 


Project 




Num. of 
sequences 


Total 

length (bp) 


Mat 

community 




UUcllclU iNcLJIU IU ZZIIIIll 


nypcibdinic yuciiu 
negro 




lO COO 
I Z,OoO 


0,U 1 O, UO^f 


Mat 
IVId L 

community 




uUclfclU INcLJIU O Dllllll 


nyptM bdiiMe yuyiiu 
negro 




19 R9R 
I z,ozo 




Mat 
IVId L 

community 




UUcllclU iNcLjlU D IU HUM 


nyptji bdinifc; yuyiiu 
negro 




1 R OAS 


C3,OUO,U I □ 


Mat 

community 


4440971 


Guerrero Negro 22-34 mm 


Hypersaline guerro 
negro 




12,522 


8,382,531 


Mat 

community 


4440972 


Guerrero Negro 34^9 mm 


Hypersaline guerro 
negro 




11,627 


7,240,219 


Open water 


AAA 1 nc 1 

444 lUb I 


UHT/AI /"l I— I A I li-i^i-M- Cii^k^+I^ irim 

HU i/aluha — upper bupnotic lum 


1 1 -.4. /— |.U A 

Hot/aloha 




7,837 


"7 A OO 1 1 C 

/,4oz, I I b 


Open water 


a a a -\ n sz rz 

4441 (Job 


HUI/ALUHA- Base ot Chlorophyll Maximum loUm 


Hot/aloha 




6,797 


b,0y1 , /4U 


Open water 


AAA mC7 

444 lUb/ 


hui/aluha— upper buphotic /um 


noi/aiona 




ii),yyz 


1 n 000 one 
IU,ozo,obb 


Open water 


444 II zb 


GS040 - Open Ocean -Tropical South Pacific 


Global ocean samp 


ng 


736 


772,365 


Open water 


A A All OC 
444 1 1 ZD 


GS041 — Open Ocean —Tropical South Pacific 


Global ocean samp 


ng 


C~7Q 

b/o 


"70Q QEQ 

/oy,ybo 


Open water 


A A A 1 1 0~7 
444 I I Z / 


GS042 — Open Ocean —Tropical South Pacific 


Global ocean samp 


ng 


CQQ 

byy 


"7QQ /ICC 

/oo,4bb 


Open water 


AAA"\-\ OQ 
444 I I zo 


GS043 — Open Ocean —Tropical South Pacific 


Global ocean samp 


ng 


"711 

/ I I 


"7QQ /ICQ 

/oy,4bo 


Open water 


A A A 1 1 OQ 

444 I IZ3 


GS044 - Open Ocean -Tropical South Pacific 


Global ocean samp 


ng 


C~7Q 

b/o 


"71/1 OIO 
/ I 4,0 I O 


Open water 


444 I loU 


GS045 - Open Ocean -Tropical South Pacific 


Global ocean samp 


ng 


730 


~70C "700 

/yb, /ao 


Open water 


A A A 1 1 O 1 

444 I I o I 


GS046 - Open Ocean -Tropical South Pacific 


Global ocean samp 


ng 


COC 

bzb 


CQO OAC\ 

boo,z4U 


Open water 


A A A 1 1 O A 

444 I I o4 


GS110b — Open Ocean — Indian Ocean — 


Global ocean samp 


ng 


49,597 


rro ennn 
bo,b(J/,z/ / 


Open water 


A A A 1 1 

444 I loo 


GS120 — Open Ocean — Indian Ocean — Madagascar 


Global ocean samp 


ng 


A C HKO 

4b,Ubz 


A£. "71 O 1 DC 

4b, / iu, I yb 


Open water 


444 I loo 


GS039 - Open Ocean -Tropical South Pacific 


Global ocean samp 


ng 


759 


obb, /yb 


Open water 


AAA"\ 1 OQ 

444 I loy 


GS122b — Open Ocean Madagascar and South Africa 


Global ocean samp 


ng 


co noc 
bu,uyb 


bz, bb/,o4o 


Open water 


444 I 1 4b 


GS037 - Open Ocean - Eastern Tropical Pacific 


Global ocean samp 


ng 


65,670 


CO CC1 A "70 

bo,bb 1 ,4/o 


Open water 


A A A 1 1 /I C 

444 I 1 4b 


GS047 - Open Ocean -Tropical South Pacific 


Global ocean samp 


ng 


cc mo 
bb,l)zo 


co 0 a n occ 
bo,o4L),Zbb 


Open water 


444 I 1 4 / 


GS112b — Open Ocean - Indian Ocean 


Global ocean samp 


ng 


52,118 


cc coo qo/i 
bb,boo,oy4 


Open water 


A A A 1 1 /i o 
444 I I4a 


GS116 - Open Ocean — Indian Ocean 


Global ocean samp 


ng 


en ooo 

bu,yoz 


a A O OO A A 1 

b4,zzo,44/ 


Open water 


444 I I bU 


GS115 - Open Ocean — Indian Ocean 


Global ocean samp 


ng 


b 1 ,Uzl) 


c/i oon nco 
b4,zoU,L)bz 


Open water 


A A A 1 1 C 1 

444 I I b I 


GS119 - Open Ocean — Indian Ocean 


Global ocean samp 


ng 


60,987 


bb,Ubb,o/4 


Open water 


444 I I bb 


GS109 — Open Ocean — Indian Ocean 


Global ocean samp 


ng 


co on 
by,o I o 


CO "7CO O/l Q 

bz, /bz,o4y 


Open water 


4441156 


GS 111 - Open Ocean - Indian Ocean 


Global ocean samp 


ng 


59,080 


62,072,289 


Open water 


4441570 


GSOOOa - Open Ocean - Sargasso Sea 


Global ocean samp 


ng 


644,551 


658,755,696 


Open water 


4441573 


GSOOOb - Open Ocean - Sargasso Sea 


Global ocean samp 


ng 


317,180 


321,026,307 


Open water 


4441574 


GSOOOc - Open Ocean - Sargasso Sea 


Global ocean samp 


ng 


368,835 


371,688,861 


Open water 


4441575 


GSOOOd - Open Ocean - Sargasso Sea 


Global ocean samp 


ng 


332,240 


335,939,509 
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Environment 


Genome ID 


Genome name 


Project 


Num. of 

sequences 


Total 

lengxn \up) 


Open water 


A A A i C~Jf 

444 1 b /b 


bbOUla - Open Ocean - bargasso bea 


Global ocean sampling 


-i /io o co 

14z,3bz 


-l/IO Oil" A AC) 

143,3 lb, 44o 


Open water 


A A A 1 C77 

444 lb// 


GS001 b - Open Ocean - Sargasso Sea 


Global ocean sampling 


on om 

yu,yu I 


on nci ooo 

yu,yb i ,zyy 


Open water 


A A A 1 mo 

444 lb /a 


(jbUUI c - Open Ocean - bargasso bea 


Global ocean sampling 


92,351 


no i-ofi ncn 

yz,bo»,ybo 


Open water 


A A A 1 CO"7 

444 I bo / 


GS017 — Open Ocean -Yucatan Channel — Mexico 


Global ocean sampling 


257,581 


OQ -1 O CO ooc 

zo I ,zby,Jzb 


Open water 


A A A 1 CQQ 

444 I boo 


GS018 — Open Ocean - Rosario Bank — Honduras 


Global ocean sampling 


1 A O 1A O 

I4z, /4o 


-i a ia nm 

1 bb,4/4,yyz 


Open water 


AAA'] COO 

444 i byz 


GS022 — Open Ocean — Eastern Tropical Pacific 


Global ocean sampling 


I z I ,bbz 


1 q 1 mo oin 
I o I ,u/y,z /U 


Open water 


A A A 1 no A 

444 I b»4 


GS026 - Open Ocean — Galapagos Islands 


Global ocean sampling 


■ino 7no 
lUz, /(Jo 


inn nyio om 

iuy,u4y,oy / 


Open water 


AAA'] CH7 

444 I bU / 


GS110a — Open Ocean — Indian Ocean 


Global ocean sampling 


QQ OQQ 

yy,zoo 


1 nn no7QQ 1 
IUU,Uy/,oo I 


Open water 


A A A 1 COO 

444 I bUa 


GS110a — Open Ocean — Indian Ocean 


Global ocean sampling 


99,781 


-i n-i o 1 o ceo 
IUI ,o lo,bby 


Open water 


A A A 1 c 1 n 

444 I b I (J 


GS110a — Open Ocean — Indian Ocean 


Global ocean sampling 


a no 7on 
lUy, /L)L) 


a a o ooo 1 c a 
I lo,Jjy, Ib4 


Open water 


yl /I /1 1 C -1 1 

444 I b II 


GS110a — Open Ocean — Indian Ocean 


Global ocean sampling 


O A O OOO 

34o,ozo 


O/IC OOC C70 

J4b,zob,b/y 


Open water 


4441614 


GS110a - Open Ocean - Indian Ocean 


Global ocean sampling 


110,720 


119,426,081 


Open water 


4441615 


GS110a - Open Ocean - Indian Ocean 


Global ocean sampling 


101,558 


105,196,135 


Open water 


4441616 


GS110a - Open Ocean - Indian Ocean 


Global ocean sampling 


107,966 


115,611,614 


Open water 


4441661 


GS023 - Open Ocean - Eastern Tropical Pacific 


Global ocean sampling 


133,051 


143,626,589 


Open water 


A A A Q 7/1 n 
444o /4U 


I A_o4oOO 


Sargasso sea 
bacterioplankton 


y4,oo i 


i o,u / o,yoy 


Coral reef 
water 


AzlA 1191 
444 I I Z I 


GS050 — Coral Atoll — Tikehau Lagoon — Fr. Polynesia 


Global ocean sampling 


71 R 
/ I D 


7RR A9Q 


Coral reef 
water 


4-44 Moo 


GS108b _ Lagoon Reef — Coccos Keeling, Inside 
Lagoon - Australia 


Global ocean sampling 


AQ RQR 

4a,oao 


co con -i o/i 
Oo,OoU, I Z4 


Coral reef 
water 


A A ATI QQ 

444 1 i jy 


GS108a — Lagoon Reef Coccos Keeling, Inside 
Lagoon - Australia 


Global ocean sampling 


c 1 7DQ 

b I , /bo 


cn oon ccd 
bU,oyu,bbo 


l^LH dl I fc)e I 

water 


4441 1R7 
4-44 MO/ 


r^OAPh Pnral roof Mnnroa PnnL'c Raw Fr 
UOU40U ^Uldl 1 ccl IVIUUIcd, ^UUI\b Day ri. 

Polynesia 


OlUUdl (JLcdll bdllipiliiy 


47692 


Cfl QfiQ yl /I O 
0U,3DiJ,440 


Coral reef 
water 


444 I O3o 


GS025 — Fringing reef — Dirty Rock, Cocos 
Island - Costa Rica 


Global ocean sampling 


1 90 R71 
I ZU,D / I 


1 9Q 7P1 9QQ 

i zy, / o i ,zyy 


Coral reef 
water 


A A AT cm 
444 lOUo 


GS048a — Coral reef — Moorea, Cooks Bay — Fr. 
Polynesia 


Global ocean sampling 


an ri r 
yu,o i o 


yZ,o lo,oU4 


Coral reef 
water 


A A AT fin/I 


PCnRI r\m\ roof A+oM Vkar\r\'\rr\m A+oll Pr 

uouo i — uorai reeT atom — nangirora atom — rr. 
Polynesia 


Global ocean sampling 


1 9Q QQO 

i Zo,yoz 


T AC\ A Q7Q1 9 
1 4U,4y /,o I Z 


Coral reef 
water 


/I /I /I 1 £11 7 
444 I O I / 


GS148 — Fringing Reef East coast ZanzibarTanzania 


Global ocean sampling 


1 n 7 7/i 1 

I U /, /4 I 


1 fl7R1 R oic 
1 U /,D 1 D,Z ID 


Coral reef 
water 


4442642 


King14LIMic20070829 


Northern line islands 


108029 


31667620 


Coral reef 
water 


4442643 


King2LIMic20070817 


Northern line islands 
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